Analysis of finance-related Reddit Communities
Overview and Motivation
In January of 2021, a major short squeeze of the stock of video game retailer GameStop and other securities took place[1]. A major driving force behind this event was the subreddit wallstreetbets, where users discussed the market situation around GameStop, focussing on the amount of short positions major hedge funds held for this stock and a “David versus Goliath”-situation unfolded. What followed was a large influx of users to the subreddit, who were convinced the stock had to go up and hedge funds had to pay for it.
Several interesting questions regarding the influence of these subreddits on the stock market emerge from this event. In this project, we want to detect communities in finance related subreddits and mine their sentiment towards discussed stocks to measure if they either have significant influence on or can predict market movements.
Initial Questions
Based on the objectives on our proposal, the initial questions that we tried to solve were divided in three aspects of our project, which are data extraction and storage, community detection, sentiment analysis, and correlation between sentiment and stock market information. The questions are as follows:
- How can we create communities inside subreddits and extract the stocks discussed inside each community?
- Is it possible to correlate the sentiment towards the stocks discussed in each community to the stock market related data?
The questions evolved during the first stages of data collecting and research about algorithms for clustering and sentiment analysis. The questions change into more specific and new questions were included. The final questions for the project are:
1. How can we identify subcommunities inside subreddit and which measures can be used to evaluate the subcommunities?
A subreddit is already a big community inside Reddit, which allows the interaction between users. It is important to represent the interaction from the users in order to find relations and smaller clusters from users which we can analyze afterwards.
3. Based on the stocks and sentiments towards them inside the communities, is it possible to find a correlation with the stock prices?
Considering the Gamestop event we would like to find out if there are relations between the sentiment that the users express in the subreddit towards a stock with the real stock prices.
Data
For this project we have two main sources of data, namely Reddit and Yahoo Finance.
Reddit comments
The data that we use from the subreddits are comments on posts and the post content. Since the official reddit api does not allow to search for posts in a specific date range, a request via the pushshift API is made to get the URLs of the top five submissons per day which top 100 posts are subsequently downloaded using a slightly modified version of RedditExtractor’s function reddit_content. These two functions are wrapped inside the function “ETL”, to conviniently fetch the data for a given date range.
source("retrieveModules.R")
tmp_stock <- ETL(subreddit_vec = "wallstreetbets",
date_list = Sys.time() - 24*60*60)## [1] "2021-07-05 20:09:26 UTC"
## [1] 1
## [1] 3
## [1] 5
## [1] 6
## [1] 80
The structure of the retrieved comments is presented in the following table:
| id | structure | post_date | comm_date | num_comments | subreddit | upvote_prop | post_score | author | user | comment_score | controversiality | comment | title | post_text | link | domain | URL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | 3_2 | 2020-11-26 03:10:23 | 2021-02-11 11:21:31 | 146 | stocks | 0.78 | 51 | BodakBlack | Starxe | 2 | 0 | This aged well. | What is the mystery stock that the Motley Fool is promoting this time? |
The Motley Fool can be super click-baity, but they were promoting a double down stock back in February when I was just getting started in stocks, I tried to find out who it was, and it was TTD, which if you dont already know, has gone from 250 at the beginning of the year to 850 now. So I believe them when they say something is a good stock pick. I was just reading a yahoo article about ENG, which I had thrown $500 into without actually bothering to find out what they are first, and an ad came up for the Motley Fool saying 46 Year Old CEO Bets $44.2 Billion on One Stock I clicked it, and after reading through some of their attention grabbing bs, again they said But this 46-year-old CEO is putting over $44 billion on the table to take over the rest of the market, and he isnt stopping any time soon. There cant be that many 46 year old CEOs with that much money, so I dont think Ill actually need to subscribe to their newsletter to find out who it is, like they want me to. My first guess is Zuck but hes 36. Elon is 49. Larry Page is 47. So who is this mystery man and what is this stock that he is so confident in? |
https://www.reddit.com/r/stocks/comments/k169a6/what_is_the_mystery_stock_that_the_motley_fool_is/ | self.stocks | https://www.reddit.com/r/stocks/comments/k169a6/what_is_the_mystery_stock_that_the_motley_fool_is/?ref=search_posts |
| 5 | 5 | 2021-02-01 09:00:13 | 2021-02-11 21:47:20 | 3540 | stocks | 0.91 | 211 | AutoModerator | myke_oxbig45 | 1 | 0 |
Thoughts on when the BABA antitrust investigation will end? Ive read in a few days, weeks, etc. Going to play some calll options as soon as its over while IV is still relatively low. |
r/Stocks Daily Discussion Monday - Feb 01, 2021 |
These daily discussions run from Monday to Friday including during our themed posts. Some helpful links:
If you have a basic question, for example “what is EPS,” then google “investopedia EPS” and click the investopedia article on it; do this for everything until you have a more in depth question or just want to share what you learned. Please discuss your portfolios in the Rate My Portfolio sticky.. See our past daily discussions here. Also links for: Technicals Tuesday, Options Trading Thursday, and Fundamentals Friday. |
https://www.reddit.com/r/stocks/comments/l9y3e7/rstocks_daily_discussion_monday_feb_01_2021/ | self.stocks | https://www.reddit.com/r/stocks/comments/l9y3e7/rstocks_daily_discussion_monday_feb_01_2021/?ref=search_posts |
| 6 | 6 | 2021-02-01 09:00:13 | 2021-02-11 12:06:30 | 3540 | stocks | 0.91 | 211 | AutoModerator | reggiewills | 1 | 0 |
=4 LIVE NOW - Hot Stocks Level 2: $SPY, $TESLA, etc. https://youtu.be/AzSdgHWWjUY |
r/Stocks Daily Discussion Monday - Feb 01, 2021 |
These daily discussions run from Monday to Friday including during our themed posts. Some helpful links:
If you have a basic question, for example “what is EPS,” then google “investopedia EPS” and click the investopedia article on it; do this for everything until you have a more in depth question or just want to share what you learned. Please discuss your portfolios in the Rate My Portfolio sticky.. See our past daily discussions here. Also links for: Technicals Tuesday, Options Trading Thursday, and Fundamentals Friday. |
https://www.reddit.com/r/stocks/comments/l9y3e7/rstocks_daily_discussion_monday_feb_01_2021/ | self.stocks | https://www.reddit.com/r/stocks/comments/l9y3e7/rstocks_daily_discussion_monday_feb_01_2021/?ref=search_posts |
Each comment contains the information of the content of the comment, user that make the comment, the related post including the content and author. Moreover, the column “structure” define the position occupied by the comment in the hierarchy of comments related to a post.
Once a user from reddit is deleted, either the column “user” or the column “author” has the value “[deleted]”. In this situation is not possible to uniquely identify users and creating the correct connection between them, which is important for the community detection. To visualize this problem, the following graphics, shows the percentage of comments per day that have deleted authors or users. In both subreddits, the highest loss of comments is originating from deleted authors with up to 50% of loss.
To work around this problem. We decided to delete all of the comments made by a user that was deleted and renamed the authors based on the post.
Moreover, We stored the subreddit data in a cloud database service MongoDB Atlas. In this way we can reproduce our results from previous data even if the users change overtime in Reddit.
The data is stored in a single database name “reddit”, which has two separate collections named “stocks” and “wallstreet” with the raw data. The manipulation of the data was made by using the library mongolite.
Stock data
To get the market data for the relevant tickers, the package ‘quantmod’ is used. It takes a list of ticker symbols, a start and an end date and returns the market data in a clean format.
tickers <- c("GME","TSLA","AMC")
l.out <- BatchGetSymbols(tickers = tickers,
first.date = lubridate::ymd_hms("2021-06-07 00:00:00"),
last.date = lubridate::ymd_hms("2021-06-10 00:00:00"),
freq.data = "daily",
cache.folder = file.path(tempdir(),
'BGS_Cache'))##
## Running BatchGetSymbols for:
## tickers =GME, TSLA, AMC
## Downloading data for benchmark ticker
## ^GSPC | yahoo (1|1) | Not Cached | Saving cache
## GME | yahoo (1|3) | Not Cached | Saving cache - Got 100% of valid prices | Got it!
## TSLA | yahoo (2|3) | Not Cached | Saving cache - Got 100% of valid prices | Youre doing good!
## AMC | yahoo (3|3) | Not Cached | Saving cache - Got 100% of valid prices | Nice!
kbl(l.out$df.tickers)| price.open | price.high | price.low | price.close | volume | price.adjusted | ref.date | ticker | ret.adjusted.prices | ret.closing.prices |
|---|---|---|---|---|---|---|---|---|---|
| 258.00 | 282.00 | 255.20 | 280.01 | 6051500 | 280.01 | 2021-06-07 | GME | NA | NA |
| 292.00 | 344.66 | 281.00 | 300.00 | 17439100 | 300.00 | 2021-06-08 | GME | 0.0713903 | 0.0713903 |
| 303.12 | 328.00 | 291.51 | 302.56 | 13429300 | 302.56 | 2021-06-09 | GME | 0.0085333 | 0.0085333 |
| 591.83 | 610.00 | 582.88 | 605.13 | 22543700 | 605.13 | 2021-06-07 | TSLA | NA | NA |
| 623.01 | 623.09 | 595.50 | 603.59 | 26053400 | 603.59 | 2021-06-08 | TSLA | -0.0025449 | -0.0025449 |
| 602.17 | 611.79 | 597.63 | 598.78 | 16584600 | 598.78 | 2021-06-09 | TSLA | -0.0079690 | -0.0079690 |
| 52.38 | 59.68 | 51.50 | 55.00 | 349094900 | 55.00 | 2021-06-07 | AMC | NA | NA |
| 57.16 | 60.62 | 52.77 | 55.05 | 214490300 | 55.05 | 2021-06-08 | AMC | 0.0009091 | 0.0009091 |
| 52.20 | 53.39 | 48.12 | 49.34 | 150361300 | 49.34 | 2021-06-09 | AMC | -0.1037239 | -0.1037239 |
Due to its caching functionality, prices have to be fetched only once, subsequent calls first check if the cached file contains all the relevant data.
Feature engineering
Sentiment analysis
To examine the correlation between market data and sentiment, this measure has to be extracted from every comment.
For calculation of the sentiment the package sentimentr It uses a dictionary based approach and also takes modifiers and negators into account, which is a major advantage in comparison to the package SentimentAnalysis
To consider the financial jargon used in these subreddits, the Loughran-McDonald dictionary is used for assigning polarity values to the words of a comment. The dictionary also has to be modified for the special slang of r/wallstreetbets
# adding and preparing the slang terms of wallstreetbets with their corresponding sentiment
wsb_specific_terms <- data.frame(x=c("to the moon", "cant go tits up", "tendies", "diamond hands", "paper hands"), y=c(1,1,1,1,-1)) %>% as_key()
# creating the modified polarity dictionary
modified_dict <- lexicon::hash_sentiment_loughran_mcdonald %>%
update_polarity_table(x=wsb_specific_terms)
# show a sample of positive and negative words of the dictionary
print(modified_dict %>% filter(y>0) %>% sample_n(10))## x y
## 1: leading 1
## 2: pleasant 1
## 3: achievements 1
## 4: excitement 1
## 5: delighted 1
## 6: delightful 1
## 7: premiere 1
## 8: improvement 1
## 9: vibrant 1
## 10: enjoyable 1
print(modified_dict %>% filter(y<0) %>% sample_n(10))## x y
## 1: subjected -1
## 2: litigants -1
## 3: tortuously -1
## 4: misinformed -1
## 5: oversupplying -1
## 6: purport -1
## 7: ceasing -1
## 8: outages -1
## 9: depressed -1
## 10: restatements -1
#wsb <- loadData("reddit", "wallstreetbets")
#stocks <- loadData("reddit", "stocks")
stocks <- readRDS("stocks.Rds")
wsb <- readRDS("wsb.Rds")After adding the sentiment, we take a look at the the most positive and negative comments of both subreddits.
stocks %>% arrange(sentiment) %>% head() %>% select(comment, sentiment) %>%
kbl() %>%
kable_material(c("hover")) %>%
kable_styling(full_width = TRUE)| comment | sentiment |
|---|---|
| But CNBC said that collusion is totally illegal and should be investigated, except when its idea dinners and behind closed doors!!! | -1.630985 |
| Calls make sense, but why risk getting put’d more worthless shares from this fraud company? | -1.575013 |
| Massive real estate defaults once foreclosure moratoriums end. Especially commerical real estate. | -1.350505 |
| Their argument is hurting their argument. | -1.224745 |
| Far more money has been lost by investors preparing for corrections, or trying to anticipate corrections, than has been lost in corrections themselves. -Peter Lynch | -1.160000 |
| Agree on the short interest and realize its a lagging indicator, but I havent read anything that equates volatility to negative beta. | -1.122683 |
stocks %>% arrange(desc(sentiment)) %>% head() %>% select(comment, sentiment) %>%
kbl() %>%
kable_material(c("hover")) %>%
kable_styling(full_width = TRUE)| comment | sentiment |
|---|---|
|
My list in rule breakers picks Current Standings Dec 22nd 2020: RB#1: Gain of 238% RB#2: Gain of 105% RB#3: Gain of 62% RB#4: Gain of 351% RB#5: Gain of 194% RB#6: Loss of -14% RB#7: Gain of 128% RB#8: Gain 237% RB#9: Gain 372% RB#10: Gain of 401% RB#11: Gain of 37% RB#12: Gain of 143% RB#13: Gain of 218% RB#14: Gain 362% RB#15: Gain 359% RB#16: Gain 137% RB#17: Gain 178% RB#18: Gain of 45% RB#19: Gain of 10% RB#20: Gain of 106% RB#21: Gain of 106% RB#22: Gain of 17% RB#23: Gain of 34% RB#24: Gain of 3% RB#25: Gain of 20% RB#26: Gain of 49% RB#27: Gain of 19% RB#28: Loss of -12% RB#29: Gain of 52% RB#30: Gain of 55% RB#31: Gain of 17% RB#32: Gain of 33% RB#33: Gain of 77% rule breaker blog |
2.752558 |
| You just get that wish I bought more but gains is gains! | 1.760918 |
| Huge win | 1.272792 |
| more gains | 1.272792 |
| eyo, gains are gains are gains. | 1.224745 |
| Ark funds are more high risk/high reward | 1.202082 |
wsb %>% arrange(sentiment) %>% head() %>% select(comment, sentiment) %>%
kbl() %>%
kable_material(c("hover")) %>%
kable_styling(full_width = TRUE)| comment | sentiment |
|---|---|
| Very much disagreed | -1.501111 |
| Taint bad | -1.414214 |
| Definitely manipulated | -1.272792 |
|
&#x200B; KNDI is sure not doing NIO any justice. |
-1.166920 |
| ah fail downwards | -1.154700 |
| but the market is closed this is criminal | -1.149048 |
wsb %>% arrange(desc(sentiment)) %>% head() %>% select(comment, sentiment) %>%
kbl() %>%
kable_material(c("hover")) %>%
kable_styling(full_width = TRUE)| comment | sentiment |
|---|---|
| But better | 1.590990 |
| Easy tendies | 1.414214 |
| Gains are gains get those tendies | 1.224745 |
| Tendies from mom are best tendies | 1.224745 |
| Memes are definitely much better. | 1.162755 |
| Gains are gains | 1.154700 |
Since we are only interested in if a stock is positive, negative or neutral, the values are recoded to 1, -1 and 0 respectively.
wsb <- wsb %>% mutate(sentiment = case_when(sentiment > 0 ~ 1,
sentiment < 0 ~ -1,
sentiment == 0 ~ 0))
stocks <- stocks %>% mutate(sentiment = case_when(sentiment > 0 ~ 1,
sentiment < 0 ~ -1,
sentiment == 0 ~ 0))Ticker Extraction
To find tickers discussed in a comment, the corresponding symbols have to be extracted.
Two approaches to achieve have been tried:
- Regex-based extraction:
A regular expression can be used to find ticker mentions inside a comment. Ticker symbols are identified by assuming they are a two to four upper case character long string preceeded and followed by zero or more non-alphanumeric characters, e.g. " GME." or " CDE".
stringr::str_match_all(c("I bought AMC and BB", "I sold Palantir. PLTR was not the best stock to invest into"), '[^A-Za-z0-9]+([A-Z]{2,4})[^A-Za-z0-9]*')## [[1]]
## [,1] [,2]
## [1,] " AMC " "AMC"
## [2,] " BB" "BB"
##
## [[2]]
## [,1] [,2]
## [1,] ". PLTR " "PLTR"
Disadvantages of this method are:
- False positives like “WSB” for Wallstreetbets or “ETF” for Exchange-traded fund. A remedy to this could an expansion of the regex or a subsequent filter on the most mentioned false positives.
- Actual company names are not recognized, so if “Palantir” gets mentioned, it is not detected as PLTR.
These disadvantages can be remedied by using a dictionary-based approach.
- Dictionary-based extraction:
For this approach a list of the top 100 actively traded stocks traded on the NYSE is used.
nyse_lst <- read.csv("nyse_lst.csv")
nyse_lst %>%
head() %>%
kbl() %>%
kable_material(c("hover")) %>%
kable_styling(full_width = TRUE)| X | symbol | name |
|---|---|---|
| 1 | BLIN | Bridgeline Digital Inc. Common Stock |
| 2 | SPCE | Virgin Galactic Holdings Inc. Common Stock |
| 3 | MRIN | Marin Software Incorporated Common Stock |
| 4 | DIDI | DiDi Global Inc. American Depositary Shares (each four representing one Class A Ordinary Share) |
| 5 | AMC | AMC Entertainment Holdings Inc. Class A Common Stock |
| 6 | SNDL | Sundial Growers Inc. Common Shares |
The names of the companies are also modified, words and Phrases like “Ltd.”, “Company”, “Group”, “Share” have been removed manually to capture the informally used phrases one would use to refer to these companies. In addition, tickers consisting of only one character are replaced with this modified name to avoid false matches, e.g. “F” for “Ford”.
nyse_lst_mod <- read.csv("nyse_lst_modified.csv")
nyse_lst_mod %>%
head() %>%
kbl() %>%
kable_material(c("hover")) %>%
kable_styling(full_width = TRUE)| X | symbol | name |
|---|---|---|
| 1 | BLIN | Bridgeline |
| 2 | SPCE | Virgin |
| 3 | MRIN | Marin |
| 4 | DIDI | DiDi |
| 5 | AMC | AMC |
| 6 | SNDL | Sundial |
To add this information, a helper function was written which adds columns for sentiment and the mentioned tickers respectively.
source("ticker_extraction.R")
example_df <- data.frame(comment = c("I bought AMC and BB", "I sold Palantir. PLTR was not the best stock to invest into but GME was"), user = c("A", "B"))
add_tickers_sentiment(example_df,
ticker_list = nyse_lst_mod)$ticker## [1] 1
## [[1]]
## [1] "AMC" "BB"
##
## [[2]]
## [1] "PLTR" "GME"
Using this approach, the ticker symbols as well as the names of the companies are extracted. In addition, false positives are avoided by only extracting symbols mentioned in the modified list.
Community detection
For the community detection we used the library igraph. For this purpose the subreddit’s data was transformed into a graph representation. Nodes represent the users of the subreddit, while interaction between the users is represented by edges. The interactions are classified into:
1. Answers to a post: direct comments to a post. For this case an edge is created between the user and the author of the post.
2. Answer to a comment: comments made as answer to other comments. For this case an edge is created between two users and it requires to use the hierarchy of comments to find the proper connection.
The weight of the edges is given by the number of interactions between the two users. A visual representation of the graph created by the comments between the dates 2021-01-23 and 2021-01-26 for the subreddit is presented in the following plot.
## `summarise()` has grouped output by 'from'. You can override using the `.groups` argument.
Clustering Algorithms:
For the community detection we considered three of the available algorithms in the igraph library, they are:
- Louvain: Algorithm based on modularity. It builds a hierarchy of communities while maximizing the modularity value.
- Infomap: Algorithm based on the Map equation.
- label propagation: The algorithm starts by given a unique labels to each node, and it propagates iteratively the labels through the network by updating the neighbors until no node need to update the label.
Since there is no external information about the communities detected, modularity is used as main measure to evaluate the quality of the clustering. Modularity measures how strong is the community structure generated and it can be defined as “a normalized tradeoff between edges covered by clusters and squared cluster degree sums”[5].
In order to compare the cluster algorithms, we evaluate all of algorithms over 4 different samples of the data as shown in the following table. We used both subreddits wallstreetbets and stocks with different sample sizes. With this test we recorded the modularity and execution time. The results from the evaluation are given in the following plots.
evaluations <- data.frame(
test = 1:4,
collectionName = c("wallstreetbets", "stocks", "wallstreetbets", "stocks"),
initial_time = c("2021-02-03", "2020-11-03", "2021-01-03", "2021-01-03"),
end_time = c("2021-02-05", "2020-11-07", "2021-01-07", "2021-01-13"),
number_comments = rep(NA, 4)
)
result <- data.frame()
for (i in 1:nrow(evaluations)) {
row <- evaluations[i, ]
if(row$collectionName =="wallstreetbets"){
eval_df <- wsb %>%
filter(comm_date>=row$initial_time, comm_date<=row$end_time)
}else{
eval_df <- stocks %>%
filter(comm_date>=row$initial_time, comm_date<=row$end_time)
}
result_evaluation <-
clustering_evaluation(row$test,
"reddit",
row$collectionName,
row$initial_time,
row$end_time,
eval_df)
#update number of comments per test
evaluations[[i, "number_comments"]] <-
result_evaluation$number_comments
result <- rbind(result, result_evaluation$df)
}
kbl(evaluations,
col.names = c("Test",
"Collection Name",
"Initial Date",
"Final Date",
"Sample size"),
caption = "Test samples")| Test | Collection Name | Initial Date | Final Date | Sample size |
|---|---|---|---|---|
| 1 | wallstreetbets | 2021-02-03 | 2021-02-05 | 311 |
| 2 | stocks | 2020-11-03 | 2020-11-07 | 1106 |
| 3 | wallstreetbets | 2021-01-03 | 2021-01-07 | 1528 |
| 4 | stocks | 2021-01-03 | 2021-01-13 | 4573 |
modularity_clusters <- result %>%
filter(measure == "modularity")
ggplotly(ggplot(data=modularity_clusters, aes(x=id_test))+
geom_bar(aes(y=value,fill=algorithm),
stat = "identity", position = "dodge")+
labs( x = "Test",
y = "Modularity",
title = "Modularity per algorithm for each test"))In terms of modularity, Louvain performed better than Infomap and Label Propagation. The highest possible value for modularity is 1.0 when all of the clusters are disconnected subgraphs. The values that Louvain achieved are higher than 0.55 which represents a strong community structure.
When we evaluate the execution time for the different algorithms, Louvain outperfoms Infomap but the execution time is very similar between Louvain and Label propagation.
In addition to the previous test in separated samples, we assessed the robustness of Louvain using the library robin [6] for the whole data for each subreddit.
To test the goodness of the communities detected by Louvain in the each complete subreddit we followed the first workflow from robin, where the stability of the algorithm is compared against random perturbations of the original graph. The stability measure we chose was VI metric. The results from this evaluation are in the next plots.
Stocks subreddit
Wallstreetbets subreddit
For both datasets the stability measure curves for Louvain compared to the null model are very close. For the stocks subreddit the AUC values we got were 0.2631427 for the real data and 0.2494959 for the null model and for the wallstreetbets subreddit were 0.2357264 for the real data and 0.2471575 for the null model. Considering also the p-values on the right side of the plots, we can conclude that the results from the Louvain community detection are statistically significant, which means that is a good algorithm for this dataset. For Wallstreetbets it is statistically significant just for perturbation lower than 0.3, which is not so ideal.
When we compare with robin the best two algorithms from our modularity test (Louvain and Label Propagation) is possible to confirm the previous findings. For both subreddits Louvain outperforms the Label Propagation algorithm.
Considering these tests, we selected Louvain as the clustering algorithm for the rest of the analysis in our project. For the previous graph representation of the data, the following communities were created with Louvain algorithm.
## `summarise()` has grouped output by 'from'. You can override using the `.groups` argument.
To analyze the content of the comments for each cluster, we inspect some random clusters to find the most common words, which are displayed in the following wordclouds. For some clusters we can find that they use quite often the same words like “stocks” and “buy”, but we also observed that we have different topics on each cluster.
par(mfrow=c(2,2),
oma = c(0,0,0,0),
mar = c(0,0,0,0))
get_word_cloud_community(posts_stocks,communities(lc),2)
get_word_cloud_community(posts_stocks,communities(lc),5)
get_word_cloud_community(posts_stocks,communities(lc),8)
get_word_cloud_community(posts_stocks,communities(lc),10)Exploratory Data Analysis
Tickers
While both subreddits talk mostly about the same tickers, the counts are significantly different. More important for the analysis is the mentions of the tickers by week; there should be no longer time frames without mentions.
It is evident here, that the mentions of tickers vary greatly. There are some weeks, where no tickers are mentioned and the counts are mostly very low. This has to be accounted for during the final analysis.
Sentiment
wsb_sent <- wsb %>% mutate(sentiment= recode(sentiment, "-1" = "Negative", "0" = "Neutral", "1" = "Positive")) %>% group_by(sentiment) %>% summarise(count=n())
stocks_sent <- stocks %>% mutate(sentiment= recode(sentiment,"-1" = "Negative", "0" = "Neutral", "1" = "Positive")) %>% group_by(sentiment) %>% summarise(count=n())
ggplotly(ggplot(wsb_sent, aes(x=sentiment, y=count))+
geom_col()+
labs(title="Count of sentiment categories for r/wallstreetbets"))ggplotly(ggplot(stocks_sent, aes(x=sentiment, y=count))+
geom_col()+
labs(title="Count of sentiment categories for r/stocks", x= "Sentiment"))Both reddits have similar distribution for the sentiment categories, the majority of comments are neutral and are not adding information about the users sentiment towards the stock.
User’s interaction in subreddits
We extracted the individual users that were active in each subreddit and try to found out if there are users that are active in both subreddits. In the following plot we present the distribution of active users in the subreddits, where we found out that more than half of the users are active in wallstreetbets (52.9%) and just 7% of the active users participate both in wallstreetbets and stocks.
Final Analysis
To answer the the question, if the sentiment towards a specific ticker has an influence on market data, cross-correlation between the sentiment and the percentage change of the price is used.
During the cross-correlation time series are lagged against each other, so that the \(t - lag\) values of one time series are correlated against the values of the other time series at time \(t\).
Since the mentioned tickers are quite sparse, only the top 10 mentioned tickers are used and missing sentiment values are handled by na.pass, so they are not considered in the calculation, but simply passed through. Imputation of the missing values is also not feasable, since sentiment values are often not available for longer time spans.
The results are visualized using the ACF plots which are returned by the ccf function. It shows the correlation of the two time series depending on the lag, including 95% confidence intervals.
Since applying a cross correlation function assumes that the both time series are stationary, this is ensured de-trending the timeseries before the calculation using ndiffs and diff.
The ACF Plots and summary tables for the top 10 tickers mentioned by the subreddits are shown below.
source("analysis_helpers.R")## Warning: package 'forecast' was built under R version 4.0.5
stocks_prep <- prepare_analysis(stocks)
wsb_prep <- prepare_analysis(wsb)| -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TSLA | 0.23 | 0.02 | 0.15 | 0.04 | 0.07 | -0.10 | -0.05 | -0.08 | 0.18 | -0.16 | 0.14 |
| PLTR | -0.02 | -0.06 | -0.06 | 0.17 | 0.04 | 0.33 | -0.02 | 0.10 | -0.25 | 0.11 | 0.06 |
| GME | 0.69 | 0.09 | 0.04 | -0.13 | -0.20 | 0.34 | 0.07 | -0.39 | 0.11 | -0.23 | 0.47 |
| NIO | 0.26 | -0.30 | 0.21 | -0.37 | 0.18 | -0.42 | 0.07 | -0.13 | -0.14 | 0.01 | 0.31 |
| AMC | -0.31 | 0.05 | -0.18 | 0.52 | -0.10 | 0.14 | -0.39 | 0.66 | -0.01 | 0.03 | 0.51 |
| AAPL | 0.16 | 0.03 | -0.04 | 0.21 | -0.08 | 0.03 | -0.05 | 0.06 | 0.00 | -0.07 | 0.22 |
| BB | 0.23 | 0.25 | -0.05 | 0.14 | -0.09 | 0.36 | 0.12 | 0.63 | -0.29 | -0.07 | 0.22 |
| BABA | -0.16 | -0.39 | -0.43 | 0.15 | 0.16 | 0.12 | -0.17 | -0.11 | 0.51 | 0.45 | 0.33 |
| AMD | 0.11 | 0.10 | 0.02 | -0.07 | -0.25 | -0.12 | -0.02 | 0.21 | 0.02 | -0.27 | 0.16 |
| NOK | 0.08 | -0.23 | -0.61 | 0.83 | -0.20 | 1.00 | -0.74 | 0.07 | -0.67 | -0.65 | 1.00 |
| rowMean | 0.13 | -0.04 | -0.10 | 0.15 | -0.05 | 0.17 | -0.12 | 0.10 | -0.05 | -0.09 | 0.34 |
The mean correlation at lag - 2 is 0.22, so the mean sentiment of wallstreetbets on a day could be used to predict market movement two days later.
| -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TSLA | -0.08 | 0.17 | 0.22 | -0.17 | 0.14 | 0.00 | 0.20 | 0.08 | -0.11 | 0.05 | 0.19 |
| AAPL | 0.03 | 0.09 | -0.25 | -0.05 | -0.13 | -0.08 | -0.05 | 0.02 | 0.10 | 0.14 | -0.11 |
| PLTR | -0.09 | 0.15 | 0.17 | 0.21 | -0.12 | 0.09 | -0.05 | -0.18 | -0.03 | -0.05 | 0.03 |
| NIO | -0.06 | 0.02 | 0.19 | 0.04 | 0.10 | 0.19 | 0.14 | 0.02 | 0.09 | 0.01 | 0.07 |
| MSFT | -0.01 | -0.01 | -0.14 | -0.09 | 0.19 | 0.02 | -0.08 | 0.20 | 0.01 | 0.01 | -0.08 |
| GME | 0.31 | -0.13 | -0.04 | -0.60 | 0.14 | -0.14 | 0.04 | -0.24 | -0.35 | -0.06 | -0.53 |
| BB | 0.17 | 0.46 | -0.07 | 0.05 | 0.05 | 0.19 | 0.20 | -0.01 | 0.19 | -0.02 | -0.09 |
| BABA | 0.07 | 0.02 | 0.02 | -0.42 | 0.06 | 0.03 | 0.30 | 0.22 | -0.23 | -0.16 | 0.07 |
| AMD | 0.17 | -0.06 | -0.26 | -0.08 | -0.09 | 0.06 | 0.12 | -0.38 | -0.11 | 0.04 | -0.14 |
| AMC | -0.26 | -0.03 | -0.08 | -0.02 | -0.10 | -0.14 | -0.13 | 0.09 | -0.07 | 0.25 | -0.17 |
| rowMean | 0.02 | 0.07 | -0.02 | -0.11 | 0.02 | 0.02 | 0.07 | -0.02 | -0.05 | 0.02 | -0.08 |
For the stocks subreddit, we get the mean correlation for every lag is close to 0, so the sentiment could not be used to predict market movements in general.
These different outcomes probably result from the sparsity of the ticker mentions, so these conclusions should be taking with a grain of salt. Because of this, an analysis of a specific community is not feasable, the data is just too sparse. To remedy this, the extraction of the data has to be done different. Only posts which mention a specific ticker in a given time frame should be extracted to combat this problem of sparse data.
References
[1] T. Di Muzio, “GameStop capitalism. Wall street vs. The reddit rally (part i),” ZBW - Leibniz Information Centre for Economics, EconStor Preprints, 2021. [Online]. Available: https://EconPapers.repec.org/RePEc:zbw:esprep:229951.
[2] A. Sharma, D. Bhuriya, and U. Singh, “Survey of stock market prediction using machine learning approach,” in 2017 international conference of electronics, communication and aerospace technology (iceca), 2017, vol. 2, pp. 506–509, doi: 10.1109/ICECA.2017.8212715.
[3] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos, “Community detection in social media,” Data Mining and Knowledge Discovery, vol. 24, no. 3, pp. 515–554, 2012.
[4] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes, “Correlating financial time series with micro-blogging activity,” in Proceedings of the fifth acm international conference on web search and data mining, 2012, pp. 513–522.
[5] U. Brandes et al., “On modularity clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 2, pp. 172–188, 2008, doi: 10.1109/TKDE.2007.190689.
[6] V. Policastro, D. Righelli, A. Carissimo, L. Cutillo, and I. D. Feis, “ROBustness in network (robin): An r package for comparison and validation of communities.” 2021, [Online]. Available: http://arxiv.org/abs/2102.03106.